# Construction of Hal-Data

## Data File Introduction

- Hal-Data 130K SFT.json: The SFT dataset based on the LLaVA style, built upon the Hal-Data foundation.
- in_domain_eval_5k.json: In-domain evaluation data used in the discriminative evaluation of Hal-Eval.
- out_of_domain_eval_5k.json: Out-of-domain evaluation data used in the discriminative evaluation of Hal-Eval.
- Hal-Data 2M: Due to limitations on data upload size, we regret to inform you that we are unable to provide Hal-Data 2M.


## Introduction
Welcome to the introduction of Hal-Data, a remarkable dataset meticulously curated through an automated hallucination annotation pipeline with AFHA. With a staggering collection of 1 million instances, Hal-Data comprises two distinct parts: Hal-Data 130k and Hal-Data -2M. In this markdown, we delve into the intricacies of creating this extraordinary dataset.

### Data Collection: Hal-Data 130k
To ensure unparalleled diversity and comprehensiveness, we embarked on a meticulous data collection endeavor. By gathering around 200K images from diverse sources, including the renowned COCO dataset \cite{lin2014coco} featuring 80K high-quality images within our domain, and incorporating an additional 80K web images from CC \cite{changpinyo2021cc3m12m}, SBU \cite{SBU}, and LAION \cite{schuhmann2022laion}, we laid a solid foundation for the dataset. Additionally, to harmonize with the refined style of LVLM outputs, we handpicked 40K image-text datasets from ShareGPT4-V \cite{Chen2023ShareGPT4V}. Leveraging the power of AFHA's ingenious annotation capabilities, we meticulously annotated this subset, culminating in the creation of Hal-Data 130k.

### Generation: Hal-Data -2M
Eager to expand the scale of our dataset while keeping costs manageable, we devised a creative approach. Fine-tuning the publicly available large-scale language model LLaMA2 13B \cite{LLaMA} through the Hal-Data 130k dataset yielded a novel hallucination data annotation model known as Hal-Annotator. By capitalizing on the Hal-Annotator's training on an extensive and diverse dataset, we were able to generate annotations of exceptional quality that were relevant to the content. This approach allowed us to scale our dataset without relying on the costly GPT-4. We carefully selected 2 million image-caption pairs from existing public datasets and deployed our pre-trained Hal-Annotator to introduce various types of hallucinations, thus annotating and enhancing the image captions.

With Hal-Data, we present to you a meticulously crafted dataset that caters to the needs of diverse language tasks. Its remarkable content and nuanced descriptions provide an invaluable resource for advancing research and development in the field. Explore the intricacies of Hal-Data and experience the magic of language-powered visions.


